Measuring Massive Multitask Language Understanding
https://arxiv.org/abs/2009.03300
https://github.com/hendrycks/test
We propose a new test to measure a text model's multitask accuracy
57 tasks
いずれも4択問題らしい
試験問題のイメージ